Skip to content

feat: improve maintainers detection [CM-1033]#3908

Merged
mbani01 merged 18 commits intomainfrom
feat/improve_maintainer_file_detection
Mar 26, 2026
Merged

feat: improve maintainers detection [CM-1033]#3908
mbani01 merged 18 commits intomainfrom
feat/improve_maintainer_file_detection

Conversation

@mbani01
Copy link
Contributor

@mbani01 mbani01 commented Mar 10, 2026

What changed

Before

  • File discovery was a sequential scan of a hard-coded flat list (MAINTAINER_FILES: 13 entries, root-only, no recursion).
  • The first matching file was used — no ranking, no scoring, no fallback strategy.
  • README.md was in the candidate list and required a simple content check for the word "maintainer".
  • AI file-selection received a plain list of filenames with no signals to rank them.
  • extract_maintainers always started from scratch — no reuse of a previously found file.
  • compare_and_update_maintainers skipped all maintainers with github_username == "unknown", including those with a valid email; no email fallback for identity lookup.
  • candidate_files and ai_suggested_file did not exist in MaintainerResult or execution metrics.
  • The full-content AI extraction prompt was always built upfront, even when the content was going to be chunked.

After

Detection pipeline (4-step with fallback)

  1. Saved file reuse — if a maintainer file was found on a previous run, it is tried first before any scanning.
  2. Ripgrep recursive search + scoringrg scans the full repo for files matching 20 governance stems (MAINTAINERS, OWNERS, CODEOWNERS, GOVERNANCE, EMERITUS, etc.) across all depths and valid extensions. Each file is scored: exact known path (100), exact stem match (50), partial stem (25), plus +1 per governance keyword found in content. All candidates are returned sorted by score; the top one is analyzed.
  3. README guard — README files are rejected immediately (no AI call) unless their content contains the word maintainer.
  4. AI file-selection fallback — if the top candidate fails, the full repo file list is scanned, pre-filtered to governance-scored files (capped at 300) with the already-failed file excluded, and passed to AI as (filename, score) tuples. The prompt instructs the model to prefer higher scores, shallower paths, and to reject files inside vendor/, node_modules/, third_party/, external/, and similar third-party directories.

Bug fixes

  • compare_and_update_maintainers: the skip guard now only fires when both github_username and email are unknown/None (previously skipped all "unknown" usernames unconditionally). New maintainers identified by email now go through find_maintainer_identity_by_email as a fallback, matching insert_new_maintainers behaviour.
  • Extraction prompt for chunked content is now built lazily inside the else branch, avoiding a wasted string allocation on every large file.

Observability

  • MaintainerResult gains candidate_files: list[tuple[str, int]] and ai_suggested_file: str | None.
  • ServiceExecution metrics now record candidate_files (top-100 by score) and ai_suggested_file on every run.

Note

Medium Risk
Moderate risk because it refactors the maintainer discovery pipeline (new ripgrep-based scanning, scoring, and AI fallback) and changes how identities are resolved for "unknown" usernames, which can affect maintainer ingestion and run-time behavior.

Overview
Improves maintainer file discovery by replacing the hard-coded filename scan with a multi-step detection pipeline: reuse the previously saved file, recursively search repo files via ripgrep with filename/content scoring and depth preference, and fall back to AI selection using scored candidates (with explicit third-party directory exclusions).

Extends observability by recording candidate_files and ai_suggested_file on MaintainerResult and persisting them in ServiceExecution metrics, and adjusts maintainer upsert logic to only skip truly unknown identities and to fall back to email-based identity lookup when github_username is "unknown". Also adds ripgrep to the git-integration Docker runner image to support the new detection.

Written by Cursor Bugbot for commit 56454b9. This will update automatically on new commits. Configure here.

@mbani01 mbani01 self-assigned this Mar 10, 2026
Copilot AI review requested due to automatic review settings March 10, 2026 15:41
@CLAassistant
Copy link

CLAassistant commented Mar 10, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves maintainer file detection in the git integration service by adding a multi-step discovery and analysis flow that combines static filename matching, dynamic ripgrep-based content search, and an AI fallback, while also surfacing more metadata about what was tried.

Changes:

  • Added ripgrep-based repo scanning (rg --files and keyword search) with fallback to os.walk, plus scoring/filtering of dynamic candidates.
  • Refactored maintainer extraction to prioritize a previously saved maintainer file, then analyze top candidates, then use AI file suggestion as a last resort.
  • Extended MaintainerResult and service execution metrics to include candidate_files and ai_suggested_file; added ripgrep to the Docker image.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py New candidate discovery + fallback extraction flow; logs and metrics now include candidate/AI-suggested file metadata.
services/apps/git_integration/src/crowdgit/models/maintainer_info.py Adds new result metadata fields (candidate_files, ai_suggested_file).
scripts/services/docker/Dockerfile.git_integration Installs ripgrep in the runner image to support dynamic search.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mbani01 mbani01 requested a review from joanagmaia March 10, 2026 16:42
@mbani01 mbani01 marked this pull request as draft March 10, 2026 17:54
@mbani01 mbani01 force-pushed the feat/improve_maintainer_file_detection branch from bc8e3df to b4dd488 Compare March 11, 2026 14:05
@mbani01 mbani01 marked this pull request as ready for review March 11, 2026 14:34
Copy link
Contributor

@joanagmaia joanagmaia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I have a couple of questions and requests to make sure that we have some more metrics given that these are big changes on the current process.

Questions:

  • With the mechanism of only picking one file for analysis we are assuming that all maintainers information will only be in 1 file right? I'm not sure if we should make sure that we won't lose data because of it.

Requests:

  • Can we run the new mechanism in like 10 repos and see the accuracy? I would even say on the current issues we have opened on Insights as well to see if we have improved coverage https://github.com/linuxfoundation/insights/issues?q=is%3Aissue%20state%3Aopen%20maintainer
  • Can we prepare a monitor in metaplane that covers the amount of repositories where we can get maintainers data for? And also the amount of projects?
  • Can we test using the Haiku model for find_maintainer_file_with_ai since it would be a simpler task then the rest of the work?

MAX_AI_FILE_LIST_SIZE = 300

# Full paths that get the highest score bonus when matched exactly
KNOWN_PATHS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also include SECURITY-INSIGHTS.md. It was supported before as well.
E.g. https://github.com/open-telemetry/opentelemetry-dotnet/blob/d54379e28c07db783452a33e119f1cdf8e7d96a6/SECURITY-INSIGHTS.yml#L13

}

# Governance stems (basename without extension, lowercased) for filename search
GOVERNANCE_STEMS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add:

My only concern here is that it seems that they use the community repo to manage some maintainers data. So here we might need to infer the repository based on the directory structure. Maybe it's too complex for us to want to support at least for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky when repo and maintainers are in different places, will check how we can support this easily

@mbani01 mbani01 force-pushed the feat/improve_maintainer_file_detection branch from 5bcd908 to 0b8a57e Compare March 26, 2026 13:36
@mbani01 mbani01 force-pushed the feat/improve_maintainer_file_detection branch from 0b8a57e to 1546f7e Compare March 26, 2026 13:45
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

mbani01 added 10 commits March 26, 2026 14:57
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
… detection

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…rd in content

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…improve prompt

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
mbani01 added 6 commits March 26, 2026 14:57
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…irectories

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 force-pushed the feat/improve_maintainer_file_detection branch from 645633a to bbacdfa Compare March 26, 2026 13:58
@mbani01 mbani01 merged commit 85160df into main Mar 26, 2026
10 checks passed
@mbani01 mbani01 deleted the feat/improve_maintainer_file_detection branch March 26, 2026 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants